Optimizing Privacy-Accuracy Tradeoff for Privacy Preserving Distance-Based Classification

نویسندگان

  • Dongjin Kim
  • Zhiyuan Chen
  • Aryya Gangopadhyay
چکیده

Privacy concerns often prevent organizations from sharing data for data mining purposes. There has been a rich literature on privacy preserving data mining techniques that can protect privacy and still allow accurate mining. Many such techniques have some parameters that need to be set correctly to achieve the desired balance between privacy protection and quality of mining results. However, there has been little research on how to tune these parameters effectively. This paper studies the problem of tuning the group size parameter for a popular privacy preserving distance-based mining technique: the condensation method. The contributions include: 1) a class-wise condensation method that selects an appropriate group size based on heuristics and avoids generating groups with mixed classes, 2) a rule-based approach that uses binary search and several rules to further optimize the setting for the group size parameter. The experimental results demonstrate the effectiveness of the authors’ approach. DOI: 10.4018/jisp.2012040102 International Journal of Information Security and Privacy, 6(2), 16-33, April-June 2012 17 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. study (Gartner Inc., 2007), there were 15 million victims of identity theft in 2006. Another study showed that identity theft cost U.S. businesses and customers $56.6 billion in 2005 (MacVittie, 2007). Therefore, legislation such as the Health Insurance Portability and Accountability Act (HIPAA) and the Gramm–Leach–Bliley Act (also known as the Financial Services Modernization Act of 1999) requires that the privacy of medical and financial data being protected. There has been a rich body of work on privacy preserving data mining (PPDM) techniques. Two excellent surveys can be found at (Aggarwal & Yu, 2008; Vaidya, Zhu, & Clifton, 2005). The goal of privacy preserving data mining is two-fold: to protect privacy of the original data and at the same time still preserve the utility of sanitized data (often measured in quality of data mining). Note that these two goals are conflicting to each other because most PPDM techniques distort the original data (e.g., by adding random noise or making data values less accurate) to provide privacy protection. Obviously, the more distortion introduced, the better the privacy protection, but the lower the utility of data. Most proposed PPDM techniques have some tunable parameters which will lead to different degree of privacy protection and data utility. Thus these parameters need to be set correctly to achieve the optimal privacy and utility tradeoff. For example, K-anonymity is a very commonly used privacy protection model (Sweeney, 2002a) which makes K people in the data set indistinguishable such that their identities will not be revealed. A number of techniques have been proposed to implement this model (Bayardo & Agrawal, 2005; LeFevre, DeWitt, & Ramakrishnan, 2005, 2006a, 2006b; Samarati, 2001; Sweeney, 2002b; Xiao & Tao, 2006). However all these techniques must set the correct value of K. If K is too large, the data may be distorted too much such that the quality of mining may become very poor. If K is too small, the degree of privacy protection may not be sufficient. More recently researchers have proposed several privacy models such as L-diversity (Machanavajjhala, Kifer, Gehrke, & Venkitasubramaniam, 2007), t-closeness (Li, Li, & Venkatasubramanian, 2007), and differential privacy (Dwork, 2006). All these models need to set some parameters, e.g., we need to set proper values for L in the L-diversity model, t in the t-closeness model, and ε (the degree of differential privacy) in the differential privacy model. However, there has been little research on how to tune these parameters efficiently and effectively. Most existing research simply leaves the task of setting parameters to users. However, without proper guidelines, users often have trouble to set the correct parameter values. Another alternative is a brute-force approach. This approach tries many possible settings of parameters and examines the utility (often in terms of mining quality) and the degree of privacy protection of each setting. It then selects the setting with the best utility-privacy tradeoff. However, computing the utility and degree of privacy protection often requires two steps: 1) the privacy preserving technique being considered needs to be applied to the original data set to generate a sanitized data set; 2) the data mining algorithm needs to be executed on the sanitized data set to generate mining results. These two steps are both time consuming and the brute-force approach needs to repeat these two steps for every parameter setting. This is clearly inefficient in practice. This paper studies the problem of optimizing parameters for a popular privacy preserving technique for distance-based classification: the condensation method (Aggarwal & Yu, 2004). The major benefit of the condensation method as compared to other methods is that it generates synthetic data so it is difficult to recover the identity of the original data. It also preserves the statistical properties of the original data so it works well for multiple distance-based classification algorithms. The condensation method works as follows. It divides data into clusters (groups) such that each cluster contains at least K points (individuals). Each group is then replaced with synthetic data by preserving statistics of the original group. However, the condensation method needs to set an appropriate 18 International Journal of Information Security and Privacy, 6(2), 16-33, April-June 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. value of K (the group size) for optimal privacyutility tradeoff. A very small group size may not provide enough privacy protection and a very large group size may lead to poor mining results. We have made the following contributions: • We propose a class-wise condensation method which automatically selects an appropriate group size. This method also ensures that each group (cluster) will only contain records from the same class (i.e., no mixed groups). This often leads to better classification accuracy. • We propose a rule-based approach to further optimize the group size selection. This approach uses binary search and several rules to quickly narrow down the range of group sizes. • We conducted extensive experiments using real data sets and the results show that the class-wise algorithm leads to better accuracy without sacrificing privacy protection. The experimental results also show that the rule-based approach often finds optimal or near optimal group sizes and takes much less time than the brute-force approach. The rest of the paper is organized as follows. We first discuss related work. We then give necessary background about the condensation method and propose the class-wise method. We will then present our rule-based approach to further optimize the group size parameter. Finally we will report experimental results and conclude the paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Privacy-preserving Distributed Analytics: Addressing the Privacy-Utility Tradeoff Using Homomorphic Encryption for Peer-to-Peer Analytics

Data is becoming increasingly valuable, but concerns over its security and privacy have limited its utility in analytics. Researchers and practitioners are constantly facing a privacy-utility tradeoff where addressing the former is often at the cost of the data utility and accuracy. In this paper, we draw upon mathematical properties of partially homomorphic encryption, a form of asymmetric key...

متن کامل

Privacy-Preserving Adversarial Networks

We propose a data-driven framework for optimizing privacy-preserving data release mechanisms toward the information-theoretically optimal tradeoff between minimizing distortion of useful data and concealing sensitive information. Our approach employs adversarially-trained neural networks to implement randomized mechanisms and to perform a variational approximation of mutual information privacy....

متن کامل

Privacy Preserving Frequency Mining in 2-Part Fully Distributed Setting

Recently, privacy preservation has become one of the key issues in data mining. In many data mining applications, computing frequencies of values or tuples of values in a data set is a fundamental operation repeatedly used. Within the context of privacy preserving data mining, several privacy preserving frequency mining solutions have been proposed. These solutions are crucial steps in many pri...

متن کامل

ارایه یک روش جدید انتشار داده‌ها با حفظ محرمانگی با هدف بهبود دقّت طبقه‌‌بندی روی داده‌های گمنام

Data collection and storage has been facilitated by the growth in electronic services, and has led to recording vast amounts of personal information in public and private organizations databases. These records often include sensitive personal information (such as income and diseases) and must be covered from others access. But in some cases, mining the data and extraction of knowledge from thes...

متن کامل

Minimax Filter: Learning to Preserve Privacy from Inference Attacks

Preserving privacy of continuous and/or high-dimensional data such as images, videos and audios, can be challenging with syntactic anonymization methods which are designed for discrete attributes. Differential privacy, which provides a more formal definition of privacy, has shown more success in sanitizing continuous data. However, both syntactic and differential privacy are susceptible to infe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJISP

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2012